Assignment 1

VAST Challenge 2021

Davmes Tan https://www.linkedin.com/in/davmestan/
07-13-2021

1. Background

1.1 Background Context

Extracted from VAST Challenge 2021 here

In the roughly twenty years that Tethys-based GAStech has been operating a natural gas production site in the island country of Kronos, it has produced remarkable profits and developed strong relationships with the government of Kronos. However, GAStech has not been as successful in demonstrating environmental stewardship.

In January, 2014, the leaders of GAStech are celebrating their new-found fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected in the disappearance, but things may not be what they seem.

1.2 The Challenge

Extracted from VAST Challenge 2021 Mini Challenge 2 here.

Many of the Abila, Kronos-based employees of GAStech have company cars which are approved for both personal and business use. Those who do not have company cars have the ability to check out company trucks for business use, but these trucks cannot be used for personal business.

Employees with company cars are happy to have these vehicles, because the company cars are generally much higher quality than the cars they would be able to afford otherwise. However, GAStech does not trust their employees. Without the employees’ knowledge, GAStech has installed geospatial tracking software in the company vehicles. The vehicles are tracked periodically as long as they are moving.

This vehicle tracking data has been made available to law enforcement to support their investigation. Unfortunately, data is not available for the day the GAStech employees went missing. Data is only available for the two weeks prior to the disappearance.

To promote local businesses, Kronos based companies provide a Kronos Kares benefit card to GASTech employees giving them discounts and rewards in exchange for collecting information about their credit card purchases and preferences as recorded on loyalty cards. This data has been made available to investigators in the hopes that it can help resolve the situation. However, Kronos Kares does not collect personal information beyond purchases.

The objective of this assignment is to assist law enforcement authorities to ascertain varying purchases made by specific GASTech employees and to identify suspicious patterns of behaviour.

2. Data Preparation

There are a total of 3 csv files provided for MC 2. They are:

There were a few issues that was needed to be resolved in both csv files.

cc_data <- read_csv("data/cc_data.csv")
loyalty_data <- read_csv("data/loyalty_data.csv")
employee_data <- read_csv("data/car-assignments.csv")

#convert timestamp from character into date/time format
cc_data$timestamp <- date_time_parse(cc_data$timestamp,
                                     zone = "",
                                     format = "%m/%d/%Y %H:%M")

#convert timestamp from character into date/time format
loyalty_data$timestamp <- date_time_parse(loyalty_data$timestamp,
                                     zone = "",
                                     format = "%m/%d/%Y")

#Amend string text for Katrina's Cafe
cc_data <- cc_data %>%
  mutate(location = str_replace_all(location,
                                    pattern = "Katerin.+",
                                    replacement = "Katrina\x27s Caf\xe9"))


loyalty_data <- loyalty_data %>%
  mutate(location = str_replace_all(location,
                                    pattern = "Katerin.+",
                                    replacement = "Katrina\x27s Caf\xe9"))
heatmap_cc <- cc_data %>%
  mutate(time60 = round_date(cc_data$timestamp, "60 minutes"),
         daydate = weekdays(timestamp),
         tempdate = timestamp + 8*60*60,
         weekend = chron::is.weekend(tempdate),
         time = format(time60, format = "%H:%M")) %>%
  select(-c(tempdate)) %>%
  group_by(location, daydate, time) %>%
  add_count(location, daydate, time, name = "count")

heatmap_cc_weekday <- heatmap_cc %>%
  filter(weekend == FALSE)

pop_heatmap_cc_weekday <- heatmap_cc %>%
  filter(location == "Abila Zacharo"|
           location =="Brew've Been Served" |
           location == "Gelatogalore" |
           location == "Guy's Gyros" |
           location == "Hallowed Grounds" |
           location == "Hippokampos" |
           location == "Katrina's Café" |
           location =="Ouzeri Elian")

heatmap_cc_weekend <- heatmap_cc %>%
  filter(weekend == TRUE)

pop_heatmap_cc_weekend <- heatmap_cc %>%
  filter(location == "Kalami Kafenion"|
           location == "Guy's Gyros" |
           location == "Hippokampos" |
           location == "Katrina's Café" |
           location =="Ouzeri Elian")

x1 <- length(unique(heatmap_cc_weekday$count))

cc_colours1 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x1)

p1 <- ggplot(heatmap_cc_weekday,
       aes(location, time)) + 
  geom_tile(aes(fill = factor(count))) + 
  scale_fill_manual(values = cc_colours1,
                    name = "Frequency") +
                    #breaks = levels(count)[seq(1, x, by = 5)]) +
  labs(x = "Locations", y = "Time (Static)", title = "Number of CC Transactions during Weekdays") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

x1a <- length(unique(pop_heatmap_cc_weekday$count))

cc_colours1a <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x1a)

p1a <- ggplot(pop_heatmap_cc_weekday,
       aes(location, time)) + 
  geom_tile(aes(fill = factor(count))) + 
  scale_fill_manual(values = cc_colours1a,
                    name = "Frequency") +
                    #breaks = levels(count)[seq(1, x, by = 5)]) +
  labs(x = "Locations", y = "Time (Static)", title = "Number of CC Transactions during Weekdays") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

#svl <- "CC Time Weekday Heatmap.png"
#ggsave(svl)

x2 <- length(unique(heatmap_cc_weekend$count))

cc_colours2 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x2)

p2 <- ggplot(heatmap_cc_weekend,
       aes(location, time)) + 
  geom_tile(aes(fill = factor(count))) + 
  scale_fill_manual(values = cc_colours2,
                    name = "Frequency") + 
                    #breaks = levels(count)[seq(1, x, by = 5)]) +
  labs(x = "Locations", y = "Time (Static)", title = "Number of CC Transactions during Weekends") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

x2a <- length(unique(pop_heatmap_cc_weekend$count))

cc_colours2a <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x2a)

p2a <- ggplot(pop_heatmap_cc_weekend,
       aes(location, time)) + 
  geom_tile(aes(fill = factor(count))) + 
  scale_fill_manual(values = cc_colours2a,
                    name = "Frequency") + 
                    #breaks = levels(count)[seq(1, x, by = 5)]) +
  labs(x = "Locations", y = "Time (Static)", title = "Number of CC Transactions during Weekends") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))


#svl <- "CC Time Weekend Heatmap.png"
#ggsave(svl)

###################################################################
#By Date
###################################################################

heatmap_cc <- cc_data %>%
  mutate(time60 = round_date(cc_data$timestamp, "60 minutes"),
         daydate = weekdays(timestamp),
         tempdate = timestamp + 8*60*60,
         weekend = chron::is.weekend(tempdate),
         time = format(time60, format = "%H:%M"),
         date = format(timestamp, format = "%m/%d/%Y")) %>%
  select(-c(tempdate)) %>%
  group_by(location, date) %>%
  add_count(location, date, name = "count")

heatmap_cc_weekday <- heatmap_cc %>%
  filter(weekend == FALSE)

heatmap_cc_weekend <- heatmap_cc %>%
  filter(weekend == TRUE)

x1 <- length(unique(heatmap_cc_weekday$count))

cc_colours1 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x1)

p3 <- ggplot(heatmap_cc_weekday,
       aes(location, date)) + 
  geom_tile(aes(fill = factor(count))) + 
  scale_fill_manual(values = cc_colours1,
                    name = "Frequency") +
                    #breaks = levels(count)[seq(1, x, by = 5)]) +
  labs(x = "Locations", y = "Date", title = "Number of CC Transactions during Weekdays") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

#svl <- "CC Weekday Heatmap.png"
#ggsave(svl)

x2 <- length(unique(heatmap_cc_weekend$count))

cc_colours2 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x2)

p4 <- ggplot(heatmap_cc_weekend,
       aes(location, date)) + 
  geom_tile(aes(fill = factor(count))) + 
  scale_fill_manual(values = cc_colours2,
                    name = "Frequency") + 
                    #breaks = levels(count)[seq(1, x, by = 5)]) +
  labs(x = "Locations", y = "Date", title = "Number of CC Transactions during Weekends") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

#svl <- "CC Weekend Heatmap.png"
#ggsave(svl)
heatmap_loy <- loyalty_data %>%
  mutate(daydate = weekdays(timestamp),
         tempdate = timestamp + 8*60*60,
         weekend = chron::is.weekend(tempdate),
         date = format(timestamp, format = "%m/%d/%Y")) %>%
  select(-c(tempdate)) %>%
  group_by(location, date) %>%
  add_count(location, date, name = "count")

heatmap_loy_weekday <- heatmap_loy %>%
  filter(weekend == FALSE)

heatmap_loy_weekend <- heatmap_loy %>%
  filter(weekend == TRUE)

x1 <- length(unique(heatmap_loy_weekday$count))

cc_colours1 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x1)

p5 <- ggplot(heatmap_loy_weekday,
       aes(location, date)) + 
  geom_tile(aes(fill = factor(count))) + 
  scale_fill_manual(values = cc_colours1,
                    name = "Frequency") +
                    #breaks = levels(count)[seq(1, x, by = 5)]) +
  labs(x = "Locations", y = "Date", title = "Loyalty Card Transactions during Weekdays") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

#svl <- "Loyalty Weekday Heatmap.png"
#ggsave(svl)

x2 <- length(unique(heatmap_loy_weekend$count))

cc_colours2 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x2)

p6 <- ggplot(heatmap_loy_weekend,
       aes(location, date)) + 
  geom_tile(aes(fill = factor(count))) + 
  scale_fill_manual(values = cc_colours2,
                    name = "Frequency") + 
                    #breaks = levels(count)[seq(1, x, by = 5)]) +
  labs(x = "Locations", y = "Date", title = "Loyalty Card Transactions during Weekends") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

#svl <- "Loyalty Weekend Heatmap.png"
#ggsave(svl)

3. Question 1 - Analysing Credit Card and Loyalty Card Data

From the heat map visualisation, it can be observed that the following locations are largely popular during weekdays.

Similar observations were seen on both set of data in credit card and loyalty card transactions.

During weekends, the the popular locations are:

Similarly, observations were consistent on both set of data in credit card and loyalty card transactions.

To discover the popular periods for the locations, we could only use credit card transactions, since the timestap has a time element to it. This is what we discovered:

3.3.1 Weekdays

It was observed that a number of locations do not encounter a relatively higher peak throughout the day, except for the following locations:

During Breakfast (0730hrs to 0830hrs):

During Dinner (1930hrs to 2030hrs)

3.3.2 Weekends

During Lunch (1330hrs to 1430hrs)

During Dinner (1930hrs to 2030hrs)

Unlike Guy Gyros, Jippolampos and Katrina’s Cafe have longer periods of higher transaction volume.

num_locations <- cc_data %>%
  group_by(location) %>%
  distinct(location)

num_locations1 <- loyalty_data %>%
  group_by(location) %>%
  distinct(location)

missinglocation <- rbind(num_locations, num_locations1) %>%
  group_by(location) %>%
  add_count(location, name = "count") %>%
  filter(count == 1)

3.4 Anomalies in the data

3.4.1 Missing Location

It was noticed that there are 34 distinct locations in the credit card transactions and 33 distinct locations in the loyalty card transactions. A quick comparison was made, and it was noticed that Daily Dealz was not inside the loyalty card transaction. An inference could be that the location does not provide any benefits for the usage of GASTech loyalty card.

txn_cc_plot <- ggplot(cc_data, aes(location, price)) + 
  labs(x = "Locations", y = "Price", title = "Transaction Amount by Credit Card") +
  geom_boxplot_interactive(aes(tooltip = price),
                           stackgroups = TRUE) + 
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

txn_loy_plot <- ggplot(loyalty_data, aes(location, price)) + 
  labs(x = "Locations", y = "Price", title = "Transaction Amount by Loyalty Card") +
  geom_boxplot_interactive(aes(tooltip = price),
                           stackgroups = TRUE) + 
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

less_txn_cc_plot <- cc_data %>%
  filter(location != "Abila Airport" &
           location != "Abila Scrapyard" &
           location != "Albert's Fine Clothing" &
           location != "Carlyle Chemical Inc." &
           location != "Frydos Autosupply n' More" &
           location != "Kronos Pipe and Irrigation" &
           location != "Maximum Iron and Steel" &
           location != "Nationwide Refinery" &
           location != "Stewart and Sons Fabrication")

less_txn_cc_plot <- ggplot(less_txn_cc_plot, aes(location, price)) + 
  labs(x = "Locations", y = "Price", title = "Transaction Amount by Credit Card") +
  geom_boxplot_interactive(aes(tooltip = price),
                           stackgroups = TRUE) + 
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

less_txn_loy_plot <- loyalty_data %>%
  filter(location != "Abila Airport" &
           location != "Abila Scrapyard" &
           location != "Albert's Fine Clothing" &
           location != "Carlyle Chemical Inc." &
           location != "Frydos Autosupply n' More" &
           location != "Kronos Pipe and Irrigation" &
           location != "Maximum Iron and Steel" &
           location != "Nationwide Refinery" &
           location != "Stewart and Sons Fabrication")

less_txn_loy_plot <- ggplot(less_txn_loy_plot, aes(location, price)) + 
  labs(x = "Locations", y = "Price", title = "Transaction Amount by Credit Card") +
  geom_boxplot_interactive(aes(tooltip = price),
                           stackgroups = TRUE) + 
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 7),
        plot.title = element_text(hjust = 0.5))

3.4.2 Transaction Anomalies

There are a series of transaction amount suspicious anomalies.

We will attempt to filter out the locations with high-paying transactions, to further observe if there are any more suspicious anomalies.

Another two transactions of $600 from Chostus Hotel and $477.60 from General Grocer could be suspicious anomalies.

suspicious_txn <- cc_data %>%
  filter(price == 4277.40 |
           price == 1239.41 |
           price == 10000 |
           price == 4429.76 |
           price == 4742.67 |
           price == 4513.16 |
           price == 600)

suspicious_txn1 <- loyalty_data %>%
  filter(price == 477.6)

The details of the suspicious transactions are as follows:

timestamp location price last4ccnum
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000.00 9551
2014-01-16 11:25:00 Nationwide Refinery 4742.67 7792
2014-01-08 09:54:00 Nationwide Refinery 4513.16 9152
2014-01-14 16:28:00 Nationwide Refinery 4429.76 2276
2014-01-14 14:13:00 Abila Scrapyard 4277.40 2276
2014-01-17 19:44:00 Albert’s Fine Clothing 1239.41 1321
2014-01-18 12:03:00 Chostus Hotel 600.00 5010
timestamp location price loyaltynum
2014-01-18 General Grocer 477.6 L9362

4. Question 2 - Adding GPS and Vehicle Data into the Analysis

#Preparing the map
bgmap <- raster("data/Geospatial/MC2-tourist.tif")


Abila_st <- st_read(dsn = "data/Geospatial", 
                    layer = 'Abila')
Reading layer `Abila' from data source 
  `D:\MadwolfDT\DataViz_Blog\_posts\2021-07-13-assignment-1\data\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84
gps <- read_csv("data/gps.csv")

gps$Timestamp <- date_time_parse(gps$Timestamp,
                                 zone = "",
                                 format = "%m/%d/%Y %H:%M:%S")

x <- gps %>%
  mutate(datestamp = as.Date(Timestamp + 60*60*8))

gps <- x

##convert
  gps_sf <- st_as_sf(gps,
                     coords = c("long", "lat"),
                     crs = 4326)

##string to gps path
  gps_path <- gps_sf %>%
    group_by(id) %>%
    summarize(m = mean(Timestamp),
              do_union = FALSE) %>%
    st_cast("LINESTRING")
  
#Discover top locations recorded
locations_gps <- gps 

#Discard the 5th decimal place and have an accuracy of 11.1m
locations_gps$lat <- round(locations_gps$lat, digits = 4)
locations_gps$long <- round(locations_gps$long, digits = 4)

4.1 Determine Places of Interests

In order to analyse the GPS data more efficiently and effectively, there is a need to determine possible Places of Interests (POIs) in Abila, Kronos. To do so, we can break them down into the following categories:

It was mentioned that GASTech company vehicles are installed with GPS and “the vehicles are tracked periodically as long as they are moving”. With that, as long as the vehicles are in stationary, the GPS would not be tracking the movement. With that, we could analyse the gps data and determine common Places of Interests (POI) by finding our the time lag between each data, grouped by the ID, and analyse all the lat/long coordinates that has a time lag of more than 3 minutes. Since we are analysing for POIs, the accuracy could be in the range of 11.1m, thus, we would only use lat/long up to 4 decimal points.

records_POI <- locations_gps %>%
  mutate(datestamp = as.Date(Timestamp + 60*60*8)) %>%
  group_by(id) %>%
  mutate(stop = Timestamp - lag(Timestamp)) %>%
  mutate(parked = ifelse(stop >60*3, TRUE,FALSE)) %>%
  mutate(lat111 = trunc(lat*1000)/1000,
         long111 = trunc(long*1000)/1000) %>%
  ungroup() %>%
  filter(parked == TRUE) %>%
  group_by(id, datestamp) %>%
  add_count(id, datestamp, name = "visitcount") %>%
  ungroup() %>%
  rename(timestamp = Timestamp)
records_POI <- records_POI[c(1,5,2,3,4,8,9,6,7,10)]

d_records_POI <- records_POI %>%
  distinct(lat,long, .keep_all = TRUE)

d_POI_tif_sf <- st_as_sf(d_records_POI, 
                   coords = c("long", "lat"), 
                   crs = 4326) %>%
  st_cast("POINT") #%>%

tmBase <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1, g = 2, b = 3,
         alpha = 0.5,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255)

tmrd_records_POI <- tmBase +
  tm_shape(d_POI_tif_sf) + 
  tm_dots(size = 0.3,
          alpha = 1,
          col = "black")

It was observed that there are 265 identified POIs in Abila. As mentioned, the POI may include homes and office. Therefore, we would attempt to segregate out the homes from these POIs. We will filter out locations, for those that had remained parked for 8hours or more.

home_POI <- records_POI %>%
  filter(stop >= 60*60*8) %>%
  distinct(id, lat, long, .keep_all = TRUE) %>%
  group_by(id) %>%
  arrange(id, lat, long) %>%
  mutate(near_lat = lat - lag(lat),
         near_long = long - lag(long),
         lat111 = trunc(lat*1000)/1000,
         long111 = trunc(long*1000)/1000) %>%
  mutate(drop = ifelse(is.na(near_lat),FALSE,
                       ifelse(between(near_lat, -0.0001,0.0001),
                              ifelse(between(near_long, -0.0001,0.0001), TRUE, FALSE),
                              ifelse(between(near_long, -0.0001,0.0001),TRUE, FALSE))))
############################################
#Aggregating ID = 28
############################################
x <- home_POI %>%
  filter(id != 28)

y <- home_POI %>%
  filter(id==28)

y$drop <- TRUE
y$drop <- ifelse(y$lat == 36.0732 & y$long == 24.8759,FALSE, y$drop)
y$drop <- ifelse(y$lat == 36.0524 & y$long == 24.8761,FALSE, y$drop)

x <- rbind(x, y)

home_POI <- x
##############################################

d_home_POI <- home_POI %>%
  filter(drop == FALSE)

d_home_POI_tif_sf <- st_as_sf(d_home_POI, 
                   coords = c("long", "lat"), 
                   crs = 4326) %>%
  select(-c(datestamp, lat111, long111,stop, parked, visitcount, near_lat, near_long,drop)) %>%
  st_cast("POINT") #%>%
  
x <- left_join(d_home_POI_tif_sf, d_home_POI, by = c("timestamp", "id")) %>%
  select(-c(datestamp, lat111, long111,stop, parked, visitcount, near_lat, near_long,drop))

d_home_POI_tif_sf <- x

tmBase_i <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1, g = 2, b = 3,
         alpha = 0.5,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255)

tm_home_POI_i <- tmBase +
  tm_shape(d_home_POI_tif_sf) + 
  tm_dots(size = 0.05,
          alpha = 1,
          col = "blue")

tm_home_POI <- tmBase +
  tm_shape(d_home_POI_tif_sf) + 
  tm_dots(size = 0.3,
          alpha = 1,
          col = "blue")

There are a total of 53 distinct locations which had GPS-marked vehicles parked for more than 8 hours. They can be a mixture of offices or homes. In addition, it was observed that Isande Borrasca, a Drill Technician with GASTech had numerous locations marked for staying for more than 8hours. That is because the GPS receiver inside his car is likely to have technical issues due to erractic location marking. However, since the general route could be observed, GPS tampering has been ruled out. As such, the 2 distinct locations that are identifed with Isande Borrasca are aggregated to the following:

gps_path_28 <- gps_sf %>%
  filter(id==28) %>%
  summarize(m = mean(Timestamp),
            do_union = FALSE) %>%
  st_cast("LINESTRING")

tmap_mode("plot")
x <- tmBase +
  tm_shape(gps_path_28) + 
  tm_lines(col = "blue")
x

d_home_or_office_POI <- d_home_POI %>%
  group_by(id) %>%
  add_count(id, name = "count") %>%
  ungroup() %>%
  filter(count >= 2) %>%
  select(-c(count))

d_cfm_home_POI <- d_home_POI %>%
  group_by(id) %>%
  add_count(id, name = "count") %>%
  ungroup() %>%
  mutate(category = "Home") %>%
  filter(count ==1 ) %>%
  filter(id < 100) %>% #remove trucks
  select(-c(count))

master_POI_list <- d_cfm_home_POI %>%
  select(-c(timestamp, datestamp, lat111, long111, parked, drop, visitcount, near_lat, near_long, stop))

x <- rbind(master_POI_list, c("GASTech", 36.0480, 24.8796, "Office"))
x$lat <- as.double(x$lat)
x$long <- as.double(x$long)

master_POI_list <- x

d_cfm_home_POI$category <- ifelse(d_cfm_home_POI$id > 100, "Office", d_cfm_home_POI$category)

distinct_home_or_office <- d_home_or_office_POI %>%
  group_by(lat, long) %>%
  add_count(lat, long, name = "count") %>%
  ungroup()

add_1 <- d_home_or_office_POI %>%
  filter(id == 1) %>%
  mutate(category = "Home")
add_1 <- add_1[-1,]

add_5 <- d_home_or_office_POI %>%
  filter(id == 5) %>%
  mutate(category = "Home")
add_5 <- add_5[-1,]

add_6 <- d_home_or_office_POI %>%
  filter(id == 6) %>%
  mutate(category = "Home")
add_6 <- add_6[-1,]

add_17 <- d_home_or_office_POI %>%
  filter(id == 17) %>%
  mutate(category = "Home")
add_17 <- add_17[-1,]

add_22 <- d_home_or_office_POI %>% #share residence with 30?
  filter(id == 22) %>%
  mutate(category = "Home")
add_22 <- add_22[-1,]

add_30 <- d_home_or_office_POI %>% #share residence with 22?
  filter(id == 30) %>%
  mutate(category = "Home")
add_30 <- add_30[-1,]

add_25 <- d_home_or_office_POI %>% #2 residences?
  filter(id == 25) %>%
  mutate(category = "Unknown")

add_9 <- d_home_or_office_POI %>% #5 places?
  filter(id == 9) %>%
  mutate(category = "Unknown")

add_21 <- d_home_or_office_POI %>% #2 places?
  filter(id == 21) %>%
  mutate(category = "Unknown")

add_28 <- d_home_or_office_POI %>% #2 places?
  filter(id == 28) %>%
  mutate(category = "Unknown")

x <- rbind(d_cfm_home_POI, add_1, add_17, add_21, add_22, add_25, add_28, add_30, add_5, add_6, add_9)

rm(add_1, add_17, add_21, add_22, add_25, add_28, add_30, add_5, add_6, add_9)

d_cfm_home_POI <- x

d_cfm_home_POI_tif_sf <- st_as_sf(d_cfm_home_POI, 
                   coords = c("long", "lat"), 
                   crs = 4326) %>%
  select(-c(datestamp, lat111, long111,stop, parked, visitcount, near_lat, near_long, drop)) %>%
  st_cast("POINT") #%>%
  
x <- left_join(d_cfm_home_POI_tif_sf, d_home_POI, by = c("timestamp", "id")) %>%
  select(-c(datestamp, lat111, long111,stop, parked, visitcount, near_lat, near_long,drop, timestamp))

x <- left_join(x, employee_data,
               by = c("id" = "CarID"))

x <- select(x, -c("CurrentEmploymentType"))

x <- x[c(1,8,7,6,4,5,2,3)]

d_cfm_home_POI_tif_sf <- x


tm_d_cfm_home_POI_i <- tm_shape(d_cfm_home_POI_tif_sf) + 
  tm_dots(size = 0.05,
          col = "category",
          palette = c("red", "blue", "green"))

tm_d_cfm_home_POI <- tm_shape(d_cfm_home_POI_tif_sf) + 
  tm_dots(size = 0.3,
          alpha = 1,
          col = "category",
          palette = c("red", "blue", "green"))

tm_d_cfm_home_POI_labels <- tm_shape(d_cfm_home_POI_tif_sf) + 
  tm_dots(size = 0.1,
          alpha = 0,
          col = "blue")
print_points_i <- function(df.x, start_time, end_time, dot_colour){
  
  st1 <- date_time_parse("2014-01-13 10:00", 
                       zone = "",
                       format = "%Y-%m-%d %H:%M")
  et1 <- date_time_parse("2014-01-13 15:00", 
                         zone = "",
                         format = "%Y-%m-%d %H:%M")
  
  x <- records_POI %>%
    filter(timestamp >= st1 & timestamp <=et1)
  
  x_tif_sf <- st_as_sf(x, 
                       coords = c("long", "lat"), 
                       crs = 4326) %>%
    st_cast("POINT") %>%
    select(-c(datestamp, lat111, long111, stop, parked, visitcount))
  
  tm.x <- tmBase_i +
    tm_shape(x_tif_sf) + 
    tm_dots(size = 0.05,
            alpha = 1,
            col = dot_colour)

}

save_emp_routes <- function(emp_id, query_date) {
  
  emp_home_point <- d_cfm_home_POI %>%
    filter(id == emp_id) %>%
    select(lat, long)
  
  emp_office_point <- d_cfm_home_POI %>%
    filter(category == "Office") %>%
    distinct(lat, long)
  
  emp_home_office_points <- rbind(emp_home_point, emp_office_point)
  
  emp_POIs <- records_POI %>%
    filter(id == emp_id) %>%
    filter(datestamp == query_date) %>%
    select(-c(visitcount)) %>%
    group_by(lat, long) %>%
    add_count(lat, long, name = "Number of Visits")
  
  emp_home_office_tif_sf <- st_as_sf(emp_home_office_points, 
                                     coords = c("long", "lat"), 
                                     crs = 4326) %>%
    st_cast("POINT") 
  
  emp_POIs_tif_sf <- st_as_sf(emp_POIs, 
                              coords = c("long", "lat"), 
                              crs = 4326) %>%
    select(-c(datestamp, lat111, long111, stop, parked, id)) %>%
    arrange(timestamp) %>%
    mutate(sequence = 1:n()) %>%
    st_cast("POINT") 
  
  emp_POIs_tif_sf <- emp_POIs_tif_sf[c(3,1,2,4)]
  
  query_path <- gps %>%
    filter(id == emp_id) %>%
    filter(datestamp == query_date)
    
  #convert to coordinates
  query_path_sf <- st_as_sf(query_path,
                     coords = c("long", "lat"),
                     crs = 4326)
  
  #string to gps path
  query_gps_path <- query_path_sf %>%
    summarize(m = mean(Timestamp),
              do_union = FALSE) %>%
    st_cast("LINESTRING")
  
  tmemp_home_office_i <- tm_shape(emp_home_office_tif_sf) + 
    tm_dots(size = 0.08,
            alpha = 1,
            col = "green")
  
  tmemp_home_office <- tm_shape(emp_home_office_tif_sf) + 
    tm_dots(size = 0.8,
            alpha = 1,
            col = "green")
  
  tmemp_POIs_i <- tm_shape(emp_POIs_tif_sf) + 
    tm_dots(size = 0.05,
            alpha = 1,
            col = "blue")
  
  tmemp_POIs <- tm_shape(emp_POIs_tif_sf) + 
    tm_dots(size = 0.5,
            alpha = 1,
            col = "blue")
  
  tmemp_gps_path_i <- tm_shape(query_gps_path) + 
    tm_lines()
  
  #prints non-interactive size
  tmOverall <- tmBase + tmemp_gps_path_i + tmemp_home_office + tmemp_POIs 
}

save_emp_routes_i <- function(emp_id, query_date) {
  
  emp_home_point <- d_cfm_home_POI %>%
    filter(id == emp_id) %>%
    select(lat, long)
  
  emp_office_point <- d_cfm_home_POI %>%
    filter(category == "Office") %>%
    distinct(lat, long)
  
  emp_home_office_points <- rbind(emp_home_point, emp_office_point)
  
  emp_POIs <- records_POI %>%
    filter(id == emp_id) %>%
    filter(datestamp == query_date) %>%
    select(-c(visitcount)) %>%
    group_by(lat, long) %>%
    add_count(lat, long, name = "Number of Visits")
  
  emp_home_office_tif_sf <- st_as_sf(emp_home_office_points, 
                                     coords = c("long", "lat"), 
                                     crs = 4326) %>%
    st_cast("POINT") 
  
  emp_POIs_tif_sf <- st_as_sf(emp_POIs, 
                              coords = c("long", "lat"), 
                              crs = 4326) %>%
    select(-c(datestamp, lat111, long111, stop, parked, id)) %>%
    arrange(timestamp) %>%
    mutate(sequence = 1:n()) %>%
    st_cast("POINT") 
  
  emp_POIs_tif_sf <- emp_POIs_tif_sf[c(3,1,2,4)]
  
  query_path <- gps %>%
    filter(id == emp_id) %>%
    filter(datestamp == query_date)
    
  #convert to coordinates
  query_path_sf <- st_as_sf(query_path,
                     coords = c("long", "lat"),
                     crs = 4326)
  
  #string to gps path
  query_gps_path <- query_path_sf %>%
    summarize(m = mean(Timestamp),
              do_union = FALSE) %>%
    st_cast("LINESTRING")
  
  tmemp_home_office_i <- tm_shape(emp_home_office_tif_sf) + 
    tm_dots(size = 0.08,
            alpha = 1,
            col = "green")
  
  tmemp_home_office <- tm_shape(emp_home_office_tif_sf) + 
    tm_dots(size = 0.8,
            alpha = 1,
            col = "green")
  
  tmemp_POIs_i <- tm_shape(emp_POIs_tif_sf) + 
    tm_dots(size = 0.05,
            alpha = 1,
            col = "blue")
  
  tmemp_POIs <- tm_shape(emp_POIs_tif_sf) + 
    tm_dots(size = 0.5,
            alpha = 1,
            col = "blue")
  
  tmemp_gps_path_i <- tm_shape(query_gps_path) + 
    tm_lines()
  
  #prints non-interactive size
  tmOverall <- tmBase + tmemp_gps_path_i + tmemp_home_office_i + tmemp_POIs_i 
}

#getting ALL the ids with positive home categorisation
emp_with_cfm_home <- d_cfm_home_POI %>%
  filter(category == "Home") %>%
  select(id)

#n <- nrow(emp_with_cfm_home)
#
#for(i in 1:n){
#  
#  print_map <- save_emp_routes(emp_with_cfm_home$id[i], "2014-01-06")
#  
#  #tmap_save(print_map, paste("06 Jan Route of ", emp_with_cfm_home$id[i], ".png", sep = ""))
#  
#}

It was determined that GASTech location is at (36.0480/ 24.8796) because there are high frequency counts by 6 of the employees, suggesting that the following 6 employees have stayed for more than 8 hours in office, at least once, during the past 2 weeks.

tmap_mode("view")
x <- tmBase_i + tm_d_cfm_home_POI_i + tm_d_cfm_home_POI_labels
x

From the map and table, it could be inferred that:

4.2 Investigation Transaction Anomalies with GPS Data

Now that we have determined the homes and office of various employees, we can proceed to investigate on the transaction anomalies.

4.2.1 $10,000 at Frydos Autosupply

txn10000_cc <- cc_data %>%
  filter(price == 10000)

txn10000_cc %>%
  kbl() %>%
  kable_styling()
timestamp location price last4ccnum
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000 9551
txn10000_loy <- loyalty_data %>%
  filter(timestamp == "2014-01-13") %>%
  filter(location == "Frydos Autosupply n' More")

txn10000_loy %>%
  kbl() %>%
  kable_styling()
timestamp location price loyaltynum
2014-01-13 Frydos Autosupply n’ More 188.57 L8328
2014-01-13 Frydos Autosupply n’ More 64.60 L6110
2014-01-13 Frydos Autosupply n’ More 202.05 L9018
2014-01-13 Frydos Autosupply n’ More 87.57 L2169
txnFrydos <- cc_data %>%
  mutate(datestamp = as.Date(timestamp + 60*60*8)) %>%
  filter(datestamp == "2014-01-13") %>%
  filter(location == "Frydos Autosupply n' More")

txnFrydos %>%
  kbl() %>%
  kable_styling()
timestamp location price last4ccnum datestamp
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000.00 9551 2014-01-13
2014-01-13 19:41:00 Frydos Autosupply n’ More 188.57 8129 2014-01-13
2014-01-13 19:59:00 Frydos Autosupply n’ More 64.60 8411 2014-01-13
2014-01-13 21:11:00 Frydos Autosupply n’ More 202.05 2418 2014-01-13

The transaction was transacted by somebody holding onto the card, 9551. And a check with the loyalty card data for all transactions on 13 Jan 2014 at Frydos Autosupply n’ More indicated that there are no transaction bearing $10,000. After crossing checking between credit card transactions and loyalty card transactions, only 2 transactions bear the exact price, suggesting the following:

We would go on to examine the routes for all the vehicles on 13 Jan and none of the route profile matches the spending pattern as suggested by the transactions. However, CarID 24 and 35 bore some resemblences and we would further examine.

card9551 <- cc_data %>%
  filter(last4ccnum == 9551) %>%
  mutate(datestamp = as.Date(timestamp+60*60*8)) %>%
  filter(datestamp == "2014-01-13") %>%
  select(-c(datestamp))

card9551 %>%
  kbl() %>%
  kable_styling()
timestamp location price last4ccnum
2014-01-13 06:04:00 Daily Dealz 2.01 9551
2014-01-13 13:18:00 U-Pump 55.25 9551
2014-01-13 13:28:00 Hippokampos 30.51 9551
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000.00 9551
2014-01-13 19:30:00 Ouzeri Elian 28.75 9551
#for(i in 1:30){
#  
#  print_map <- save_emp_routes(i, "2014-01-13")
#  
#  tmap_save(print_map, paste("13 Jan Route of ", i, ".png", sep = ""))
#  
#}
#
#for(i in 32:35){
#  
#  print_map <- save_emp_routes(i, "2014-01-13")
#  
#  tmap_save(print_map, paste("13 Jan Route of ", i, ".png", sep = ""))
#  
#}
#
#print_map <- save_emp_routes(101, "2014-01-13")
#  
#tmap_save(print_map, paste("13 Jan Route of ", 101, ".png", sep = ""))
#
#for(i in 107:107){
#  
#  print_map <- save_emp_routes(i, "2014-01-13")
#  
#  tmap_save(print_map, paste("13 Jan Route of ", i, ".png", sep = ""))
#  
#}

4.2.1.1 Vehicle ID 24

st1 <- date_time_parse("2014-01-13 00:00", 
                       zone = "",
                       format = "%Y-%m-%d %H:%M")
et1 <- date_time_parse("2014-01-13 15:00", 
                       zone = "",
                       format = "%Y-%m-%d %H:%M")

x <- records_POI %>%
  filter(timestamp >= st1 & timestamp <=et1)

print.x <- print_points_i(x, st1, et1, "red")

tmap_mode("view")
print.x

It was observed that carID 24 was the only vehicle to have made a stop near U-Pump on 13 Jan before the transaction time at 13:18.

card24_1 <- print_map <- save_emp_routes_i(24, "2014-01-13")
card24_2 <- print_map <- save_emp_routes_i(24, "2014-01-14")

tmap_mode("view")

card24_1
card24_2
card24_POIs <- records_POI %>%
  filter(id == 24) %>%
  filter(datestamp == "2014-01-13"| datestamp == "2014-01-14") %>%
  select(-c(visitcount, lat111, long111)) %>%
  group_by(datestamp) %>%
  arrange(timestamp) %>%
  mutate("Sequence" = 1:n()) %>%
  ungroup() %>%
  select(-c(datestamp, id, stop, parked))

x <- card24_POIs %>%
  left_join(select(master_POI_list, category, lat, long),
            by = c("lat", "long"))

card24_POIs <- x

card24_POIs %>%
  kbl() %>%
  kable_styling()
timestamp lat long Sequence category
2014-01-13 07:32:01 36.0625 24.8988 1 NA
2014-01-13 08:07:01 36.0541 24.9012 2 NA
2014-01-13 11:16:01 36.0480 24.8796 3 Office
2014-01-13 11:46:01 36.0767 24.8576 4 NA
2014-01-13 12:31:01 36.0767 24.8576 5 NA
2014-01-13 13:22:01 36.0678 24.8715 6 NA
2014-01-13 17:57:01 36.0480 24.8796 7 Office
2014-01-13 19:29:01 36.0549 24.9018 8 NA
2014-01-14 03:20:01 36.0625 24.8989 1 NA
2014-01-14 07:47:01 36.0782 24.8721 2 Home
2014-01-14 08:18:01 36.0734 24.8642 3 NA
2014-01-14 11:19:01 36.0480 24.8796 4 Office
2014-01-14 11:52:01 36.0589 24.8928 5 NA
2014-01-14 12:17:01 36.0589 24.8928 6 NA
2014-01-14 14:04:01 36.0635 24.8510 7 NA
2014-01-14 17:46:01 36.0480 24.8796 8 Office
2014-01-14 18:54:01 36.0625 24.8988 9 NA
2014-01-14 19:02:01 36.0550 24.9018 10 NA
2014-01-14 20:41:01 36.0550 24.9019 11 NA

Observations from Vehicle 24.

While carID = 24 is suspicious, we are unable to conclusive link him to card 9551.

4.2.1.2 Vehicle ID 15

card35_1 <- print_map <- save_emp_routes_i(35, "2014-01-13")
card35_2 <- print_map <- save_emp_routes_i(35, "2014-01-14")

tmap_mode("view")

card35_1
card35_2
card35_POIs <- records_POI %>%
  filter(id == 35) %>%
  filter(datestamp == "2014-01-13"| datestamp == "2014-01-14") %>%
  select(-c(visitcount, lat111, long111)) %>%
  group_by(datestamp) %>%
  arrange(timestamp) %>%
  mutate("Sequence" = 1:n()) %>%
  ungroup() %>%
  select(-c(datestamp, id, stop, parked))

x <- card35_POIs %>%
  left_join(select(master_POI_list, category, lat, long),
            by = c("lat", "long"))

card35_POIs <- x

card35_POIs %>%
  kbl() %>%
  kable_styling()
timestamp lat long Sequence category
2014-01-13 06:46:01 36.0763 24.8747 1 NA
2014-01-13 06:59:01 36.0675 24.8733 2 NA
2014-01-13 12:18:01 36.0480 24.8796 3 Office
2014-01-13 13:41:01 36.0558 24.9026 4 NA
2014-01-13 17:54:01 36.0480 24.8796 5 Office
2014-01-13 19:39:01 36.0763 24.8747 6 NA
2014-01-13 19:51:01 36.0767 24.8576 7 NA
2014-01-13 20:36:01 36.0767 24.8576 8 NA
2014-01-14 06:38:01 36.0762 24.8747 1 Home
2014-01-14 07:00:01 36.0675 24.8734 2 NA
2014-01-14 07:09:01 36.0675 24.8733 3 NA
2014-01-14 12:37:01 36.0480 24.8796 4 Office
2014-01-14 13:36:01 36.0598 24.8580 5 NA
2014-01-14 18:01:01 36.0480 24.8796 6 Office
2014-01-14 19:09:01 36.0762 24.8747 7 Home
2014-01-14 20:26:01 36.0767 24.8576 8 NA

While CarID 35 may have the stops similar to the transactor of $10,000, the timestamp for the stops do not coincede with the transactions. In addition, the routes for the 2 days do not have any anomalies sighted.

4.3 Detecting Anomalies from Movement

Now that we have identified the specific POIs, we can now turn our focus to identify anomalies in the movement of the vehicles.

gps_employee <- left_join(gps, employee_data,
                          by = c("id" = "CarID"))
  
x <- gps_employee %>%
  mutate(time = as_hms(Timestamp)) %>%
  group_by(id, datestamp) %>%
  mutate(count = n()) %>%
  ungroup()

group1_gps_employee <- x %>% filter(id >= 1 & id <= 9) %>%
  mutate(name = paste(FirstName, LastName, sep =" "))
group2_gps_employee <- x %>% filter(id >= 10 & id <= 18) %>%
  mutate(name = paste(FirstName, LastName, sep =" "))
group3_gps_employee <- x %>% filter(id >= 19 & id <= 27)%>%
  mutate(name = paste(FirstName, LastName, sep =" "))
group4_gps_employee <- x %>% filter(id >= 28 & id <= 35)%>%
  mutate(name = paste(FirstName, LastName, sep =" "))
truck_gps_employee <- x %>% filter(id >= 101 & id <= 107)%>%
  mutate(name = paste(FirstName, LastName, sep =" "))

moving_chart1 <- ggplot(group1_gps_employee,
                       aes(label = lat, label2 = long)) +
  facet_wrap(.~name) +
  geom_point(aes(time, datestamp)) + 
  labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 6),
        plot.title = element_text(hjust = 0.5))

moving_chart2 <- ggplot(group2_gps_employee,
                       aes(label = lat, label2 = long)) +
  facet_wrap(.~name) +
  geom_point(aes(time, datestamp)) + 
  labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 6),
        plot.title = element_text(hjust = 0.5))

moving_chart3 <- ggplot(group3_gps_employee,
                       aes(label = lat, label2 = long)) +
  facet_wrap(.~name) +
  geom_point(aes(time, datestamp)) + 
  labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 6),
        plot.title = element_text(hjust = 0.5))

moving_chart4 <- ggplot(group4_gps_employee,
                       aes(label = lat, label2 = long)) +
  facet_wrap(.~name) +
  geom_point(aes(time, datestamp)) + 
  labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 6),
        plot.title = element_text(hjust = 0.5))

moving_chart5 <- ggplot(truck_gps_employee,
                       aes(label = lat, label2 = long)) +
  facet_wrap(.~name) +
  geom_point(aes(time, datestamp)) + 
  labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 6),
        plot.title = element_text(hjust = 0.5))

From the charts, a couple of anomalies were observed by the following employees:

We will examine them in detail.

gps_employee <- left_join(gps, employee_data,
                          by = c("id" = "CarID"))
  
x <- gps_employee %>%
  mutate(time = as_hms(Timestamp)) %>%
  group_by(id, datestamp) %>%
  mutate(count = n()) %>%
  ungroup()

detail_group1_gps_employee <- x %>% filter(id == 1 |
                                           id == 16 |
                                           id == 28 |
                                           id == 9 |
                                           id == 8 |
                                           id == 5) %>%
  mutate(name = paste(FirstName, LastName, sep =" "))
detail_group2_gps_employee <- x %>% filter(id == 15 |
                                           id == 21 |
                                           id == 26 |
                                           id == 24 |
                                           id == 19 |
                                           id == 29) %>%
  mutate(name = paste(FirstName, LastName, sep =" "))

detail_moving_chart1 <- ggplot(detail_group1_gps_employee,
                       aes(label = lat, label2 = long)) +
  facet_wrap(.~name) +
  geom_point(aes(time, datestamp)) + 
  labs(x = "", y = "", title = "Vehicle Moving Plot") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 6),
        plot.title = element_text(hjust = 0.5))

detail_moving_chart2 <- ggplot(detail_group2_gps_employee,
                       aes(label = lat, label2 = long)) +
  facet_wrap(.~name) +
  geom_point(aes(time, datestamp)) + 
  labs(x = "", y = "", title = "Vehicle Moving Plot") +
  theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
        axis.text.y = element_text(size = 6),
        plot.title = element_text(hjust = 0.5))

#removing the interactive to prevent longer load
#ggplotly(detail_moving_chart1)
#ggplotly(detail_moving_chart2)
detail_moving_chart1
detail_moving_chart2

5. Question 3 - Inferring Owners of Each Credit Card with Loyalty Card

##Merging cc_data and loyalty_data 

day_cc_data <- cc_data %>%
  mutate(day = get_day(timestamp))

day_loyalty_data <- loyalty_data %>%
  mutate(day = get_day(timestamp))

cc_loyalty_data <- merge(day_cc_data, day_loyalty_data,
                         by = c("day", "location", "price"),
                         all.x = TRUE)

data_1286 <- cc_loyalty_data %>%
  filter(last4ccnum == 1286)

#finding duplicated data caused by merge
#Reason 1: 1 cc card user uses 2 loyalty cards
#Reason 2: 1 loyalty card holder uses 2 cc cards
cc_loyalty_data <- cc_loyalty_data %>%
    add_count(day, location, price, name = "count")

#filter out all those with duplicates done by merging
cc_loyalty_data <- cc_loyalty_data %>%
  filter(count <=1)

#some cc transactions did not use loyalty card
#we remove them first
cc_loyalty_data <- cc_loyalty_data %>%
  filter(!is.na(loyaltynum))

#having a table matching last4ccnum and loyaltynum
cc_loyalty_binding <- cc_loyalty_data %>%
  distinct(last4ccnum, loyaltynum)

#check for more than 1 cc
cc_loyalty_binding <- cc_loyalty_binding %>%
  add_count(last4ccnum, name = "cc_count")

#remove the cc duplicate
cc_loyalty_binding <- cc_loyalty_binding %>%
  filter(cc_count <=1)

#check for more than 1 loyalty card
cc_loyalty_binding <- cc_loyalty_binding %>%
  add_count(loyaltynum, name = "lo_count")

#remove the loyalty card duplicate
cc_loyalty_binding <- cc_loyalty_binding %>%
  filter(lo_count <=1)

cc_loyalty_data <- cc_data %>%
  mutate(loyaltynum = "")

#updating the loyalty number with loyalty_cc binding
cc_loyalty_data$loyaltynum <- cc_loyalty_binding$loyaltynum[match(cc_loyalty_data$last4ccnum, cc_loyalty_binding$last4ccnum)]

#find unidentified loyalty_data
without_loyalty_data <- cc_loyalty_data %>%
  filter(is.na(loyaltynum)) %>%
  distinct(last4ccnum)

without_loyalty_data <- without_loyalty_data$last4ccnum

cc_data_without_loyalty <- cc_data %>%
  filter(last4ccnum == without_loyalty_data[1]|
         last4ccnum == without_loyalty_data[2]|
         last4ccnum == without_loyalty_data[3])

cc_data_without_loyalty <- cc_data_without_loyalty %>%
  mutate(day = get_day(timestamp))

shared_cc_loyalty <- merge(cc_data_without_loyalty, day_loyalty_data,
           by = c("day", "location", "price"),
                         all.x = TRUE)

#discovered 1 user could have used 2 different types of credit cards
shared_cc_loyalty <- shared_cc_loyalty %>%
  filter(!is.na(loyaltynum)) %>%
  distinct(last4ccnum, loyaltynum)

##########################################
#At this stage, we found out the following:
#1286 uses 2 loyalty cards, L3288 and L3572
#L6267 uses 2 credit cards, 6691 and 6889
##########################################

#adjust the cc_loyalty_binding to remove duplicates
cc_loyalty_binding <- cc_loyalty_binding %>%
  filter(loyaltynum != "L3288")

cc_loyalty_binding <- cc_loyalty_binding %>%
  select(-c(cc_count, lo_count))

cc_loyalty_binding <- rbind(cc_loyalty_binding, shared_cc_loyalty)

colnames(cc_loyalty_binding)[which(names(cc_loyalty_binding) == "last4ccnum")] <- "Credit Card"
colnames(cc_loyalty_binding)[which(names(cc_loyalty_binding) == "loyaltynum")] <- "Loyalty Card"

#########################################
#Attentping to find transaction that matches but are distinctly different
#########################################

questionable_data <- merge(day_cc_data, day_loyalty_data,
                         by = c("day", "location", "price"),
                         all.x = TRUE)

#finding duplicated data caused by merge
#Reason 1: 1 cc card user uses 2 loyalty cards
#Reason 2: 1 loyalty card holder uses 2 cc cards
questionable_data <- questionable_data %>%
    add_count(day, location, price, name = "count")

#filter out all those with duplicates done by merging
questionable_data <- questionable_data %>%
  filter(count >1)

questionable_loyalty1 <- loyalty_data %>%
  filter(location == "Katrina's Café" &
           price == 26.60 &
           get_day(timestamp) == 9)

questionable_loyalty2 <- loyalty_data %>%
  filter(location == "Guy's Gyros" &
           price == 8.23 &
           get_day(timestamp) == 9)

questionable_loyalty3 <- loyalty_data %>%
  filter(location == "Hippokampos" &
           price == 63.21 &
           get_day(timestamp) == 11)

problem_data <- rbind(questionable_loyalty1, questionable_loyalty2, questionable_loyalty3)

cc_questionable1 <- cc_data %>%
  filter(location == "Katrina's Café" &
           price == 26.60 &
           get_day(timestamp) == 9)

cc_questionable2 <- cc_data %>%
  filter(location == "Guy's Gyros" &
           price == 8.23 &
           get_day(timestamp) == 9)

cc_questionable3 <- cc_data %>%
  filter(location == "Hippokampos" &
           price == 63.21 &
           get_day(timestamp) == 11)

cc_problem_data <- rbind(cc_questionable1, cc_questionable2, cc_questionable3)
DT::datatable(cc_loyalty_binding, 
              options = list(
                autoWidth = FALSE,
                columnDefs = list(list(width = '1px',
                                       className = 'dt-center',
                                       targets = c(0,1,2))))) %>%
  formatStyle(0,
              target = 'row',
              lineHeight='75%')

The above table would be able to infer the credit card holder to the specific loyalty card number. However, there were uncertainties in the data. For instance, credit card transactions timestamp were indicated with date and time while loyalty card transactions were indicated with date only. There may be occasions where the date, price and location may match between both sets of data but they were different transactions for the day.

The transactions involved are:

There were 2 transactions at each location bearing the same price at the same date inside loyalty data. However, we were unsure which transaction from each dataset belongs to which. And therefore, these observations would be excluded.

It was observed that:

We can see that owner of 1286 used two different loyalty cards at various instances.

data_1286 <- data_1286 %>%
  select(-c(timestamp.y, day))

colnames(data_1286)[which(names(data_1286) == "timestamp.x")] <- "timestamp"

col_order <- c("timestamp",
               "location", "price",
               "last4ccnum", "loyaltynum")
data_1286 <- data_1286[, col_order]

DT::datatable(data_1286, 
              options = list(
                autoWidth = FALSE,
                columnDefs = list(list(width = '1px',
                                       className = 'dt-center',
                                       targets = c(0:5)))))